Abstract:This letter investigates the problem of energy efficient collaborative strategy for mobile embodied artificial intelligence network (MEAN) over wireless communication. In the considered model, the agents execute the tasks through collaboration, and they can switch between two operating modes based on the signal-to-noise ratio (SNR) and global collaboration. The dual-mode comprises the base station (BS)-assisted collaborative mode, in which agents make decisions through semantic communication with BS and then collaborate on tasks, and the local computing mode, in which the agents make decisions and execute tasks independently. Due to the dynamic wireless communication and flexible collaboration strategy, we jointly consider computation energy, communication energy, and task-execution energy with specific collaborative gains into a mixed-integer nonlinear programming (MINLP) optimization problem whose goal is to minimize the total system energy consumption. To solve it, we propose a lower-complexity enumeration algorithm: first, we get the optimal closed-form solution for semantic compression ratio and transmit power by proving the strict convexity. Second, we determine the scale of collaboration and the operating mode of each agent by a greedy sorting algorithm based on individual energy-saving potentials. Simulation results show that the proposed algorithm can significantly reduce the total energy consumption compared to benchmark schemes.
Abstract:Understanding why independently trained neural networks from different modalities converge toward shared representations, and where this convergence leads, remains an open question in representation learning. All existing evidence relies on symmetric similarity measures, which can detect convergence but are structurally blind to its direction. We introduce directional convergence analysis using cycle-kNN, an asymmetric alignment measure, applied across dozens of independently trained unimodal models spanning point clouds, vision, and language. We uncover a consistent directional asymmetry: non-language modalities move toward the neighborhood structure of language significantly more than the reverse, and this pattern holds across all model families and scales--yet is entirely invisible to symmetric measures. Mechanistic analysis traces the directionality to feature density asymmetry, whereby language representations occupy the most compact regions of representational space. The Information Bottleneck framework provides a principled interpretation: optimization under compression drives representations toward discrete, compositional structures characteristic of language. We formalize this as the Wittgensteinian Representation Hypothesis: the semantic structure of language is the asymptotic attractor of multimodal representation convergence.
Abstract:Acquiring the channel state information from limited and noisy observations at pilot positions is critical for wireless multiple-input multiple-output (MIMO)-orthogonal frequency division multiplexing (OFDM) systems. In this paper, we view this process as a conditional generative task in which the partial noisy channel estimates at the pilots are utilized as a ``prompt'' to guide the diffusion ``inpainting'' of the underlying channel. To this end, we resort to a general Conditional Diffusion Transformer (CDiT) framework with a well-designed network architecture and update rule. In particular, we design a dedicated embedding strategy to encode and adapt to different pilot patterns and noise levels, and utilize a special cross-attention mechanism to align the partial raw channel observations with the denoised channel at each time step of the generation process. This architecture effectively anchors the diffusion process, enabling the model to accurately recover full channel details from limited noisy observations. Comprehensive experimental results show that, the proposed approach achieves a performance gain of over 5 dB compared to the baselines under varying noise conditions, and provides robust channel acquisition even under a sparse pilot density of 1/32 without significant performance loss compared to the denser pilot cases. Moreover, it is capable of generating high-quality channel matrices within just 10 inference steps, effectively balancing estimation accuracy with computational efficiency and inference speed. Ablation studies demonstrate the rationality of the model design and the necessity of its modules.
Abstract:Digital twin (DT) technology offers transformative potential for vehicular networks, enabling high-fidelity virtual representations for enhanced safety and automation. However, seamless DT synchronization in dynamic environments faces challenges such as massive data transmission, precision sensing, and strict computational constraints. This paper proposes an integrated sensing, computing, and semantic communication (ISCSC) framework tailored for DT-assisted vehicular networks in the near-field (NF) regime. Leveraging a multi-user multiple-input multiple-output (MU-MIMO) configuration, each roadside unit (RSU) employs semantic communication to serve vehicles while simultaneously utilizing millimeter-wave (mmWave) radar for environmental mapping. We implement particle filtering at RSUs to achieve high-precision vehicle tracking. To optimize performance, we formulate a joint optimization problem balancing semantic communication rates and sensing accuracy under limited computational resources and power budget. Our solution includes a hybrid heuristic algorithm for vehicle-to-RSU assignment and an alternating optimization approach for determining semantic extraction ratios and beamforming matrices. Performance is extensively evaluated via the Cramér-Rao bound (CRB) for angle and distance estimation, semantic transmission rates, and resource utilization. Numerical results demonstrate that the proposed ISCSC framework achieves a 20% improvement in transmission rate while maintaining the sensing accuracy of existing integrated sensing and communication (ISAC) schemes under constrained resource conditions.
Abstract:Large language models (LLMs) exhibit strong reasoning and conversational abilities, but ensuring reliable behavior in multi-turn interactions remains challenging. In many real-world applications, agents must succeed in one-shot settings where retries are impossible. Existing approaches either rely on reflection or post-hoc evaluation, which require additional attempts, or assume fully trainable models that cannot leverage proprietary LLMs. We propose an asymmetric actor-critic framework for reliable conversational agents. A powerful proprietary LLM acts as the actor, while a smaller open-source critic provides runtime supervision, monitoring the actor's actions and intervening within the same interaction trajectory. Unlike training-based actor-critic methods, our framework supervises a fixed actor operating in open-ended conversational environments. The design leverages a generation-verification asymmetry: while high-quality generation requires large models, effective oversight can often be achieved by smaller ones. We further introduce a data generation pipeline that produces supervision signals for critic fine-tuning without modifying the actor. Experiments on $τ$-bench and UserBench show that our approach significantly improves reliability and task success over strong single-agent baselines. Moreover, lightweight open-source critics rival or surpass larger proprietary models in the critic role, and critic fine-tuning yields additional gains over several state-of-the-art methods.
Abstract:In this paper, we investigate the performance of a fluid antenna relay (FAR)-assisted downlink communication system utilizing non-orthogonal multiple access (NOMA). The FAR, which integrates a fluid antenna system (FAS), is equipped on an autonomous aerial vehicle (AAV), and introduces extra degrees of freedom to improve the performance of the system. The transmission is divided into a first phase from the base station (BS) to the users and the FAR, and a second phase where the FAR forwards the signal using amplify-and-forward (AF) or decode-and-forward (DF) relaying to reduce the outage probability (OP) for the user maintaining weaker channel conditions. To analyze the OP performance of the weak user, Copula theory and the Gaussian copula function are employed to model the statistical distribution of the FAS channels. Analytical expressions for weak user's OP are derived for both the AF and the DF schemes. Simulation results validate the effectiveness of the proposed scheme, showing that it consistently outperforms benchmark schemes without the FAR. In addition, numerical simulations also demonstrate the values of the relaying scheme selection parameter under different FAR positions and communication outage thresholds.
Abstract:Learning-based semantic communication (SemCom) has recently emerged as a promising paradigm for improving the transmission efficiency of wireless networks. However, existing methods typically rely on extensive end-to-end training, which is both inflexible and computationally expensive in dynamic wireless environments. Moreover, they fail to exploit redundancy across multiple transmissions of semantically similar content, limiting overall efficiency. To overcome these limitations, we propose a channel-aware generative adversarial network (GAN) inversion-based joint source-channel coding (CAGI-JSCC) framework that enables training-free SemCom by leveraging a pre-trained SemanticStyleGAN model. By explicitly incorporating wireless channel characteristics into the GAN inversion process, CAGI-JSCC adapts to varying channel conditions without additional training. Furthermore, we introduce a cache-enabled dynamic codebook (CDC) that caches disentangled semantic components at both the transmitter and receiver, allowing the system to reuse previously transmitted content. This semantic-level caching can continuously reduce redundant transmissions as experience accumulates. Extensive experiments on image transmission demonstrate the effectiveness of the proposed framework. In particular, our system achieves comparable perceptual quality with an average bandwidth compression ratio (BCR) of 1/224, and as low as 1/1024 for a single image, significantly outperforming baselines with a BCR of 1/128.
Abstract:Generating high-quality 360° panoramic videos from perspective input is one of the crucial applications for virtual reality (VR), whereby high-resolution videos are especially important for immersive experience. Existing methods are constrained by computational limitations of vanilla diffusion models, only supporting $\leq$ 1K resolution native generation and relying on suboptimal post super-resolution to increase resolution. We introduce CubeComposer, a novel spatio-temporal autoregressive diffusion model that natively generates 4K-resolution 360° videos. By decomposing videos into cubemap representations with six faces, CubeComposer autoregressively synthesizes content in a well-planned spatio-temporal order, reducing memory demands while enabling high-resolution output. Specifically, to address challenges in multi-dimensional autoregression, we propose: (1) a spatio-temporal autoregressive strategy that orchestrates 360° video generation across cube faces and time windows for coherent synthesis; (2) a cube face context management mechanism, equipped with a sparse context attention design to improve efficiency; and (3) continuity-aware techniques, including cube-aware positional encoding, padding, and blending to eliminate boundary seams. Extensive experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality, supporting practical VR application scenarios. Project page: https://lg-li.github.io/project/cubecomposer
Abstract:Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.
Abstract:It is commonly believed that gradient compression in federated learning (FL) enjoys significant improvement in communication efficiency with negligible performance degradation. In this paper, we find that gradient compression induces sharper loss landscapes in federated learning, particularly under non-IID data distributions, which suggests hindered generalization capability. The recently emerging Sharpness Aware Minimization (SAM) effectively searches for a flat minima by incorporating a gradient ascent step (i.e., perturbing the model with gradients) before the celebrated stochastic gradient descent. Nonetheless, the direct application of SAM in FL suffers from inaccurate estimation of the global perturbation due to data heterogeneity. Existing approaches propose to utilize the model update from the previous communication round as a rough estimate. However, its effectiveness is hindered when model update compression is incorporated. In this paper, we propose FedSynSAM, which leverages the global model trajectory to construct synthetic data and facilitates an accurate estimation of the global perturbation. The convergence of the proposed algorithm is established, and extensive experiments are conducted to validate its effectiveness.